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Abstract 

Motivation: Variation calling is the process of detecting differences between donor and consensus DNA via high- 
throughput sequencing read mapping. When evaluating the performance of different variation calling methods, a 
typical scenario is to simulate artificial (diploid) genomes and sample reads from those. After variation calling, one 
can then compute precision and recall statistics. This works reliably on SNPs but on larger indels there is the 
problem of invariance: a predicted deletion/insertion can differ slightly from the true one, yet both make the same 
change to the genome. Also exactly correct predictions are rare, especially on larger insertions, so one should 
consider some notion of approximate predictions for fair comparison. 

Results: We propose a full genome alignment-based strategy that allows for fair comparison of variation calling 
predictions: First, we apply the predicted variations to the consensus genome to create as many haploid genomes 
as are necessary to explain the variations. Second, we align the haploid genomes to the (aligned) artificial diploid 
genomes allowing arbitrary recombinations. The resulting haploid to diploid alignments tells how much the 
predictions differ from the true ones, solving the invariance issues in direct variation comparison. In an effort to 
make the approach scalable to real genomes, we develop a simple variant of the classical edit distance dynamic 
programming algorithm and apply the diagonal doubling technique to optimise the computation. We experiment 
with the approach on simulated predictions and also on real prediction data from a variation calling challenge. 



Background 

In the study of human genetics, variation calling from 
high-throughput sequencing reads [1] is a revolutionary 
technique. Conceptually, the process is remarkably simple. 
Sequence random short fragments from donor DNA and 
align them to the reference (consensus) genome. Outside 
repetitive regions a good alignment is unique, hence if the 
resulting multiple alignment (read pileup) has columns 
where reads vote for something differing from the refer- 
ence genome, the donor is very likely to actually contain 
this variant in his/her genome. Numerous methods have 
been proposed to fine-tune this simple scheme. One could 
argue that this problem is actually an enormous local mul- 
tiple alignment problem, and so it is not surprising that 
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methods trying to locally improve the alignment are able 
to improve the accuracy. This standard approach only cap- 
tures small scale variations, and the methods for discover- 
ing large insertions, deletions, translocations, reversals, 
etc., are more involved [2]. 

Surprisingly there has been no systematic approach for 
studying the accuracy issue of variation calling predictions. 
With real data sets one can always resort to Sanger 
sequencing to validate the findings, but this is expensive. 
With simulated data sets one can compare to the ground 
truth and compute precision/recall statistics. A typical 
simulated data set is produced by first applying random 
mutations to a reference genome or selecting a random 
subset of a predefined set of known frequent variations 
in the population (or a mixture of these), to create an 
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artificial donor genome. We consider here the setting of a 
diploid genome, where this simulation is conducted twice, 
taking into account heterozygousity/homozygousity infor- 
mation on the variants to create a realistic diploid genome. 
What we assume is that the alignment of the diploid gen- 
ome to the reference is preserved. Then one can simulate 
a random pool of short reads from the artificial diploid 
genome with high enough coverage such that variation 
calling is plausible. Once a list of predicted variants is pro- 
duced, one can match it to the list of variants used for 
creating the artificial diploid. 

However, this direct comparison has the shortcoming 
that invariances among predictions are not properly 
considered. To see this, consider reference ACGGAAGGT, 
donor ACGGT, and let the simulated real variant be dele- 
tion of GAAG at position 3 (starting indexing from 0). 
Predicted deletion AAGG at position 4 results in the 
same donor, but pair-wise comparison of predicted and 
real variants would not reveal this. While these kind of 
simple invariances are easy to take into account, things 
get much more involved with split/merged predictions 
of various types. Clearly some kind of dynamic program- 
ming approach would be required to find best editing of 
the predicted variants to make them match the real 
ones. Here we assume that the allowed variant types are 
only insertions, deletions and substitutions, since rear- 
rangements can always be modelled as series of the for- 
mer type of operations. 

Instead of correcting for the invariances in the manner 
sketched above, we propose to resort to a natural variant 
of the familiar sequence level full-genome alignment. The 
approach is as follows. First apply the predicted variations 
to the consensus genome to create as many haploid gen- 
omes that are necessary to explain the variations. Hetero- 
zygous variations are only applied on one haploid and 
homozygous variations on all. Second, align the haploid 
genomes to the (aligned) artificial diploid genomes allow- 
ing arbitrary recombinations. The intuition is that with 
short read mapping one cannot usually obtain much hap- 
lotype level information, but only determine whether the 
variations are heterozygous or homozygous. The haploid 
genome created by applying all non-overlapping predicted 
variants is better the closer it is to an arbitrary recombina- 
tion of the diploid alleles. This is what the approach mea- 
sures by edit distance. If there are overlapping predicted 
heterozygous variants, multiple haploid genomes are cre- 
ated and compared with a (different) recombination of the 
diploid alleles. Hence, the resulting haploid to diploid 
alignments tell how much the predictions differ overall 
from the true ones, solving the invariance issues in direct 
variation comparison. To make the approach scalable to 
real genomes, we develop a simple variant of the classical 
edit distance dynamic programming algorithm and apply 



the diagonal doubling technique to optimise the computa- 
tion. We experiment with the approach on simulated pre- 
dictions and on real prediction data from a variation 
calling challenge [3]. 

Methods 

Problem 

Let B 1 and B 2 represent diploid genome sequences gener- 
ated by applying mutations to a reference sequence P. Let 
tables m 1 and m 2 store for each position in B 1 and B 2 the 
corresponding position in P defined by the multiple 
alignment of P, B 1 , and B 2 , which in turn is defined by 
how P is mutated to B 1 and B 2 . For example, the multiple 
alignment 

0123456 7 8 

P AGCTGAT-A-C 

Bl ACCTGATCACG 

B2 ACCTGCT-ACC 

is defined by homozygous substitution G- >C at 1st posi- 
tion of P , heterozygous substitution A - >C at 5th position 
of P , heterozygous insertion of C after 6th position of P, 
homozygous insertion of C after 7th position of P, and 
heterozygous substitution C- >G at 8th position of P. The 
corresponding mappings are 

ml 01234566778 

m2 0123456778 

After simulating reads from B 1 and B 2 , let some varia- 
tion calling method end up with the above list of varia- 
tions. Then one can construct a haploid genome A by 
applying greedily all predicted mutations to P from left 
to right. In the general case, one could be left with over- 
lapping mutations. Then one should construct another 
haploid genome applying all homozygous mutations as 
well; it is possible that there are more overlaps because 
of prediction errors [4], but since our simulated ground- 
truth is diploid, generating more predicted genomes is 
not beneficial in this case. 

In our running example, all mutations can be applied 
simultaneously, as there are no overlapping heterozygous 
mutations. The result is A = ACCTGCTCACC, which can 
be aligned to the above multiple alignment with no errors 
by allowing the alignment to recombine B 1 and B 2 at the 
positions they share in common with P. We call such a 
recombination a valid reference guided recombination. 
The result is shown below (lower case showing the align- 
ment of A). 

A ACCTGCTCACC 

Bl ACCTGATcaCG 

B2 acctgct-Acc 
Invariance 

To highlight the issue of invariance, let us consider a dif- 
ferent reference P = GATCAATGAG, a single diploid B 
and a haploid A given by some variation calling method. 
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Let us assume that B and A differ from P by the following 
variations 



position P B 


position P A 


0 GAG 


1 


A T 


3 C CC 


2 


T C 


4 AC 


3 


C CC 


7 G A 


3 


CAC 




7 


G A 



The evaluation method used in Variathon 2013 [3] finds 
only two common variations: the one in position 3 and 
the one in position 7. Even if one was to use a method to 
match two invariant variants that make the same effect to 
the reference, this would not help in this example; only 
when the variants are applied altogether, the result is the 
same genome GTCCCATAAG: 

P GATC - AATGAG P GATC-AATGAG 

B G-TCCCATAAG A GTCCC-ATAAG 

To allow approximate predictions and account for invar - 
iance, we propose to compute the unit cost (Levenshtein) 
edit distance between A and any recombination B 1 and B 2 
according to their alignment 

The computation of such edit distance is an easy exten- 
sion of the standard dynamic programming, and has been 
studied earlier under name jumping alignments in the 
context of amino acid sequences [5]. However, global 
alignment of full genomes is infeasible using standard 
quadratic time dynamic programming. Therefore we 
extend the diagonal doubling optimisation [6] to compute 
the edit distance in O {dri) time, where d is the least edit 
distance between the haploid and any valid reference 
guided recombination of the diploid genomes, and n is the 
maximum of the input sequence lengths. 

Algorithm 

We simultaneously fill two dynamic programming matrices 
{d}^ ) and {dfj 2 ) such that [d]^ ) (resp. (<2?- 2 )) gives the mini- 
mum edit distance between A[0 ... i] and R[0 ... j] such that 
R[0 ... jl is a valid reference guided recombination of B 1 
[0 ... ji] and B 2 [0 ... j 2 ] ending at B 1 (resp. ending at B 2 ). Fill- 
ing the matrices is an easy extension of standard dynamic 
programming for edit distance: take minimum across the 
matrices at columns ji and j 2 such that B 1 and B 2 can 
recombine. Such j\ and ji values have the property that 

ml \ji ~ 1 ] = 17,2 \]2 — 1 ] = j or 
m l \ji - 1] = m 2 \j 2 ] - 1 = j 

and 

m l \ji -2] f]f m 2 \j 2 - 2] 

This results into algorithm 1, where call d = diploid align 
(A, Bl, B2, m , m , reference length) returns the minimum 



edit distance d between A and any valid reference guided 
recombination of B 1 and B 2 . 

Proof of correctness for Algorithm 1 

Definition 1. Let A and B be strings. Now ed{A, B) is 
the Levenshtein edit distance between the strings A and B. 

Definition 2. Let A, B 1 and B 2 be strings. Rj is the refer- 
ence guided recombination of fi 1 [0 ... k] and B 2 [0 ... v] that 
minimises ed(A, Rj), where k = max{/ | m l [i\ < j} (likewise 
for v). Rj is as Rj but the last character must be from the 
string B . Rj is denned similarly. 

Lemma 1. Let A, B 1 and B 2 be strings and Rj, Rj and Rj 
as above. Now 



ed(Aj, Rj) = min 



ed(Aj, Rj) 
ed(Aj, Rj). 



Algorithm 1 Haploid to diploid alignment algorithm, 
function DIPLOID _ALIGN(A, B 1 , B 2 , m\ m 2 , 

reference_length) 

^[lengthfE 1 ) + l][length(A) + 1] 

<i 2 [lengthCB 2 ) + l][length(A) + 1] 
/* Initialise as in Levenshtein distance. */ 
function CALCULATE_COLUMN(w«£ra:, col, 

B_char) 

for i <- 1, length(A) do 



matrix[col]\i] <- min 



matrix[col — l[i] + 1] 

matrix\col — l[i — 1] + [A[i] f Bjchar] 

matrix[col][i — 1] + 1 



function MIN_FROM_TO(/h>m, to) 
for i = 1 -» length(A) do 



to\i\ <— min 



to\i\ 
from[i] 



columnl <— 1 
column2 <— 1 

for /' <— 0, reference_length - 1 do 
if m 1 [columnl - 1] = j then 

CALCULATE_COLUMN((i 1 , columnl, B 1 [columnl 
- 1]) 

columnl <— columnl + 1 
if m 2 [column2 - 1] = j then 

CALCULATE_COLUMN(^ 2 , columnl, B 2 [column2 
-1]) 

column2 <— column2 + 1 
if m 1 [columnl - 2] = j and (m 2 [column2 - 1] = j + 
1 or m 2 [column2 - 2] = j) then 

MIN.FROMJTO^ 1 [columnl - I], d 2 [column2 

-1]) 

if m 2 [column2 - 2] = j and (m 1 [columnl - 1] = j + 
1 or m 1 [columnl - 2] = j) then 

MIN_FROM_TO(J 2 [co/ M w«2 - 1], d l [columnl 

- 1]) 
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while m 1 [columnl - 1] < /' do 
CALCULATEJIOLUMN^ 1 , columnl, B\columnl 
-1]) 

columnl <— columnl + 1 
while m 2 [column2 - 1] < /' do 
CALCULATE_COLUMN(aP, columnl, B 2 [column2 
-1]) 

columnl <— columnl + 1 



return min 



d 1 [length^ 1 )] [length(A)] 



[ d 2 [length (B 2 )] [length (A)] 
Lemma 2. When a recombination 

B 2 [0...v-2]B 1 [k- 1...] 
is possible, that is, 

m 2 [v — 2] + 1 = m x [fe — 1] or m 2 [u — 2] =m l [k — 2], 
it holds that 

ed(A,vR^,„_ 2] ) 
'edCA,-!,^,^,) 

= ed(A„Ri, 1|fe _ 1| ) 

where 3=1 if A[i] * B l [k - 1] and 0 otherwise. 
Proof. First we observe that 

m 2 [v - 2] + 1 = m 1 [fe - l] 

implies 

m 1 [fe - 2] < m 2 [u - 2] = m 1 [fe - l] - 1. 
Using this we get 

ed(A„ R^i[ fe _ 2] ) 

ed^R^,) 
edfA^R^,) 
ed(A,_ 1( R^^p + l 

edCA^R^,^,) 
ed(A„R 2 m2|u _ 2] ) 
edfA^R^,) 
edfA.^R^,) 
ed(Ai_ 1( R^p + l 

and by Lemma 1 

ed (A, -R m 2 |u _ 2 |) + 1 
ed(A i _ 1 , -R m 2[„_ 2 ]) + 5 
ed(A,_!, R^irfc-i]) + 1 

ed^RVife-i]) 



which is as claimed. Note that if m 2 [v - 2] = m\k - 2], 
the claim follows directly without the observation. 
Lemma 3. If the strings B 1 and B 2 do not overlap, that is 

m 1 [length (B 1 ) - l] < m 2 [0] or 
m [length (B 2 ) - l] < m 1 [0] 

then we can compute the diploid_align as the 
Levenshtein distance between the string A and the 
appropriate concatenation B l B 2 or B 2 B l . 

Theorem 1. The Algorithm 1 calculates 



d 1 [fe][f] = ed(^,R^ i]) )and 
^M^ed^R 2 ^^). 

Proof. The proof is by induction over the index sum 
i + k for the case 

d l [k][i\ = ed^R 1 ^-!]))- 

The other case is symmetric. 

First consider the base cases. The algorithm initialises 
the matrices as in the Levenshtein distance algorithm. 
By Lemma 3 we can assume that the strings B 1 and B 2 
overlap and thus no recombination needs to be consid- 
ered before the first iteration of the algorithm. The base 
cases are thus as required. Let us assume that the claim 
holds for i + k < p + q. If the condition 



m 2 [column2 — 2] 
(m 1 [columnl — 1] 
m 1 [columnl — 1] = 



= = j and 
= j + 1 or 



(i) 



held on a previous iteration, then the minimum of the 
columns d}[q - 1] and d 2 [columnl - 1] was assigned to 
d}[q - 1]. It also means that 



m 2 [column2 — 2] + 1 = m 1 [q — l] . 



(2) 



Let 8 = 1 if A\p] * B x [q - 1] and 0 otherwise. Now the 
algorithm calculates 



^MlP] = min 



min 



min 



+ 1 



d l [q-l\ [P] 
d 2 [column2 — 1] [p] 
d\q-l][p-l] +s 
d 2 [column2 — 1] [p — 1] 
dHq] [P-1] + 1 



«I(Ap,]^_ 2] ) +i 
ed(A p ,R m2[column22] ) 

■ed(Vi^ 2] ) +s 
ed(A p _ i , R m2 [ CBlumn2 _ 2 \ ) 

ed(Vi. 



mm 
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where the first equality holds by the induction 
assumption. The second equality holds as by equation 
(2) we can use Lemma 2 with the assignment v := col- 
umn2 and k := q. 

If condition (1) did not hold on the previous iteration, 
no minimisation was performed. Thus the algorithm cal- 
culates 

d 1 [4-1] \p] + i 
d 1 [q] [p] = min { d 1 [q - 1] [p - I] + S 
d l [q] [p- 1] + 1 

ed (A p , R l m i [q -2}) + 1 

ed(A p _!, R l m 2 \q-2\) + & 

ed(A p _!, R 1 m i[ (? _i]) + 1 
= ed(A p , R 1 m i [<,_!]) 

which is correct, as no recombination could be made 
at this point. 
Diagonal doubling 

The runtime optimisation known as diagonal doubling [6] 
can be used also with Algorithm 1. The idea of the original 
algorithm is as follows. Consider checking whether the 
Levenshtein edit distance between two sequences of length 
n is at most k, where k is a given cutoff. Then it is enough 
to do computation on diagonals i - j such that \i - j\ < [kl 
2], since every change of diagonal costs 1; any path visiting 
a cell outside this diagonal zone results into edit script 
with cost more than k. With unknown cutoff, it is easy to 
see that starting with k = 1, doubling the value of k at each 
iteration, recomputing the diagonal zone, and stopping the 
doubling when the computed distance value remains 
unchanged, results in an O (dn) time algorithm, where d is 
the edit distance. The idea works with technical changes 
for two sequences of different length. With some care, one 
can perform the computation in O {d) space. 

To modify the optimisation for our haploid to diploid 
alignment, the following details need to be taken into 
account. Because the two columns calculated by the algo- 
rithm at every iteration may not completely overlap, one 
needs to make sure that the minimisation step is done cor- 
rectly. The minimum of the two columns can only be 
taken for the overlapping part. Because of the two 
Levenshtein calculations, we also need to choose the start- 
ing cutoff values with care. In particular one needs to start 
with ra«*(|length(A) - length^ 1 )!, |length(A) - length 
(B 2 )|), so that the diagonal is big enough to cover the bot- 
tom right corner in both matrices. 

The diagonal doubling does not affect the correctness of 
the Algorithm 1. The same argument as in the case of 
Levenshtein distance applies here. Let us assume that we 
are interested only on edit distances less than a certain 
cutoff value. As both matrices contain a Levenshtein dis- 
tance, it must be that the values grow monotonically when 



we deviate from the diagonal. Thus, when taking the mini- 
mum between the two columns, the values outside the 
overlapping area must be greater than the current cutoff 
value. This means that the optimum edit distance must 
derive from the values in the overlapping area. 

Results and discussion 

To test the algorithm, we implemented it in C with the 
diagonal doubling optimisation [6]. The diploid_a- 
lign algorithm was first tested with generated data. 
Two diploid strings were generated from a DNA string 
of appropriate length. Single inserts, deletes and substi- 
tutions were applied each with probability of 0.003, and 
with probability 0.5 they were applied to both diploids. 
This gives approximately k/100 variations in a string of 
length n. The algorithm was then run five times with 
the original string as the haploid input and an average 
of the runtimes was taken. The test machine was a lap- 
top with a Intel Core 2 Duo T7300 2GHz processor 
running Linux. The results are shown in the Table 1. 
The actual distance is always about half the generated 
amount of variations, as nearly half the variations are 
heterozygous and in those cases one allele has the same 
base as the haploid input (original sequence). 

The second test used artificially generated diploids 
from human chromosome 20 created for a pilot varia- 
tion calling challenge, Variathon 2013 [3]. From each of 
the two submissions taking part in the challenge we cre- 
ated one predicted haploid by a simple script that 
applied the predicted variants to the reference genome. 
These predicted haploids were then aligned to the 
diploid. Table 2 shows the original evaluations from the 
challenge (first 6 columns) and our new evaluation with 
edit distance. As can be seen, our evaluation agrees with 
the evaluation done for the challenge. 

We also ran a small experiment to highlight the issue of 
invariance. With the example given in Section Methods, 

Table 1 Results of running diploid align with 
generated data. 



input size 



runtime in seconds 



variations 



distance 



10000 


0.014000 


95 


48 


20000 


0.054000 


190 


90 


30000 


0.168000 


291 


145 


40000 


0.258000 


382 


190 


50000 


0.446000 


478 


236 


60000 


0.562000 


561 


281 


70000 


0.712000 


656 


329 


80000 


1.122000 


745 


377 


90000 


1.262000 


843 


419 


100000 


1.748000 


931 


460 


1000000 


148.619995 


10051 


4983 
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Table 2 Results with predicted haploids from Variathon. 

haploid tp fp fn precision recall edit distance runtime in minutes 

GBT 66190 323 29714 0.995 0.690 23232 690 

BoBiocomp 86589 2288 9267 0.974 0.903 17422 380 



Invariance, the evaluation method used in Variathon 2013 
[3] finds only two common variations, as claimed. Not 
surprisingly, the edit distance between the diploid and the 
haploid is zero, as in both cases the variations when 
combined produce the same genome GTCCCATAAG. 

Conclusions 

We proposed an approach for assessing variation calling 
predictions in the case of artificial diploid genomes, using 
a modification of global alignment with a diagonal dou- 
bling optimisation to compute it on large inputs. The 
motivation for the approach was to avoid the invariance 
problems of direct variation comparison. 

We tested the approach on a haploid instance to show 
its robustness and scalability to complete (small) genomes. 
We also compared the output of our algorithm with 
applicable evaluations from a variation calling challenge 
and the results agreed. 
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